小模型协助推理,已读,未整理
[2311.15566] SpotServe: Serving Generative Large Language Models on Preemptible Instances
抢占式节点上的LLM推理,可能和loongserve有协作的地方
Meta OSDI的两个工作
MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale | USENIX
调度器的工作,听说质量非常好
资源分配的工作
Mooncake: Kimi’s KVCache-centric Architecture for LLM Serving
月之暗面的工作
Mooncake (4): 月饼的皮和馅是怎样制成的,Mooncake 传输引擎开源以及后续的计划 - 知乎
传输部分的优化
市场理论
高级微观经济学系列:General Equilibrium
贝叶斯优化
DLRover
蚂蚁集团论文解析
对于不同的请求的一个优化
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
alibaba长文本
[2406.17565] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
华为PD 分离的工作
[2405.07719] USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
方佳瑞关于长文本的工作,里面包含了带宽需求分析
hao-ai-lab/vllm-ltr: [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
预测什么时候完成LLM
penghuima/awesome-serverless-papers: Collect papers about serverless computing research
serverless 论文集
Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters | IEEE Journals & Magazine | IEEE Xplore
Jinxin组论文
Jin Xin 组PhD博士,个人主页有很多相关的论文笔记
Session 4 DL
Session 6 Cloud Computing
Session 11 ML Scheduling
线性代数
[2412.03213] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
InfiniGen的后续工作
[原创长文]2024.10-开源大模型推理引擎现状及常见推理优化方法 - 知乎
长文本稀疏化工作
DuoAttention
[2410.05076] TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
NIPS 2024 Star Attention Pro Max版本
Session 3 Deep Learning and Training
Session 6 Serverless
Session 9 ML Serving